因为python处理json比较方便,所以首先测试一下csv和json哪个快。
首先生成测试数据
# coding: utf-8import jsonimport csvimport randomfrom string import letterslow = 1e2 # 3-10位数字hi = 1e11cnt = 100000 # 10万条total = {}for _ in range(cnt):
total[str(random.randrange(low, hi))] = "".join(random.sample(letters, 10))with open("data.json", "w") as f:
f.write(json.dumps(total, ensure_ascii=False))with open("data.csv", "w") as f:
writer = csv.writer(f, delimiter=',')
writer.writerows(total.items())
然后对比由这两者生成dict的速度
# coding: utf-8import jsonimport csvfrom time import clockt0 = clock()total1 = json.load(open("data.json"))t1 = clock()total2 = {}with open("data.csv") as f:
reader = csv.reader(f)
for k, v in reader:
total2[k] = vt2 = clock()print "json: %fs" % (t1 - t0)print "csv: %fs" % (t2 - t1)
输出是:
json: 0.109953s
csv: 0.066411s
果然csv还是蛮快的,那我们就用它吧。
接下来解决更新问题。我不知道题主对于重复项需要怎么处理,所以都写了。
# 先生成数据,同之前的做法。low = 1e2hi = 1e11cnt = 100000new = {}for _ in range(cnt):
new[str(random.randrange(low, hi))] = "".join(random.sample(letters, 10))# 找出重复项,因为是随机生成的数据,所以恰好没有重复项duplicate = {k:v for k, v in new.items() if k in total}# 输出重复项print(json.dumps(duplicate, ensure_ascii=False, indent=4))# 1. 如果重复项是用new覆盖totaltotal.update(new)# 2. 如果是保留totalnew.update(total)total = new# 然后再写回csv文件中with open("data.csv", "w") as f:
writer = csv.writer(f, delimiter=',')
writer.writerows(total.items())
至于运行时间,如果不算上输出重复项的时间,不到0.5s。算上的话大概也就0.8s。