前段时间,经理让我去找一些企业的信息,我平常习惯于使用爱企查。所以,便想着写一个程序来实现这个,所以有以下的代码:
import json import requests import re from lxml import etree url="https://aiqicha.baidu.com/s?q="+公司名称+"=0" headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36", "Cookie": "BAIDUID=FCA8661E3619BECE060CC564924BCC62:FG=1; PSTM=1598866843; BIDUPSID=E0F38C456F9E422ADF83AC42B7D6101A; BDUSS=WQ0VGd1RFNjMmZsallMY2h0cHpxcGJ3UX4tc000d1RSU3RFaUt0eTE2R1VGSGhmSVFBQUFBJCQAAAAAAAAAAAEAAAA3fsVHxfSzzLrDs7UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJSHUF-Uh1BfO; BDUSS_BFESS=WQ0VGd1RFNjMmZsallMY2h0cHpxcGJ3UX4tc000d1RSU3RFaUt0eTE2R1VGSGhmSVFBQUFBJCQAAAAAAAAAAAEAAAA3fsVHxfSzzLrDs7UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJSHUF-Uh1BfO; BDPPN=4a85ba200a8603ef878bc33a1be441f3; log_guid=1a14b30029743b225cc8614df11b9eb2; H_PS_PSSID=7560_32606_1431_32045_32680_32116_31322_32691; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDSFRCVID=B_FOJeC627JtTMnro8G-M4zom7dhgP3TH6aogQEIojxEwhB2gJ6wEG0PeM8g0KAbDINlogKK3gOTH4PF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=tRKOoILKfIt3fP36qRQj-ICShUFs3qRlB2Q-5KL-JhcMSh6kK4PWQIuIjh6y26bb2IvToMbdJJjoeUjHytn82MLWM-KHKMIqb2TxoUJHBCnJhhvq-xOzX4AebPRiJ-b9Qg-JbpQ7tt5W8ncFbT7l5hKpbt-q0x-jLTnhVn0MBCK0HPonHjKKejoX3f; Hm_lvt_baca6fe3dceaf818f5f835b0ae97e4cc=1599189361,1599210076,1599439817,1599439901; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; delPer=0; PSINO=6; Hm_lpvt_baca6fe3dceaf818f5f835b0ae97e4cc=1599448132"} res=requests.get(url,headers=headers) res=res.text.replace('\/','') res=res.encode('utf-8').decode('unicode_escape') # res=re.findall('{(.*?)}',res) res=re.findall(r'{"pid":(.*?)}],',res) # print(res) for aa in res: # # aa=aa.strip('<em>') aa=aa.replace('<em>','') aa=aa.replace('<\/em>','') # print(aa) bb=re.findall(r'"entName":"(.*?)",',aa) cc=re.findall(r'"regCap":"(.*?)",',aa) bids=re.findall(r'"bid":"(.*?)",',aa) gongsiming={'username':'', 'zijin':'', 'dizhi':''} for ae,ac,bid in zip(bb,cc,bids): # print(ae,ac,bid) # if ae=="北京蜂盒科技有限公司": # print(ac) gongsiming={'username':ae, 'zijin':ac, 'dizhi':bid} # gongsiming['username']=ae # gongsiming['zijin']=ac # gongsiming['dizhi']=bid print(gongsiming)
我这里需要的是公司的名称、注册资金,其他的参数都是不需要的,所以这里我只做了简单的提取,想要提取其他信息,用正则选以下就好了。至于为什么使用正则,主要是因为这个源代码太复杂了,本想用json,但是没搞懂json,使用正则效果也是一样。
评论