To refresh pandas knowledge. To do several groupbys and joins to solve the task.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
%matplotlib inline
from grader import Grader
DATA_FOLDER = '../readonly/final_project_data/'
transactions = pd.read_csv(os.path.join(DATA_FOLDER, 'sales_train.csv.gz'))
items = pd.read_csv(os.path.join(DATA_FOLDER, 'items.csv'))
item_categories = pd.read_csv(os.path.join(DATA_FOLDER, 'item_categories.csv'))
shops = pd.read_csv(os.path.join(DATA_FOLDER, 'shops.csv'))
grader = Grader()
print(np.shape(transactions))
transactions.head()

0. Print the shape of the loaded dataframes and use df.head function to print several rows. Examine the features you are given.

1
2
print(np.shape(transactions))
transactions.head()

1. What was the maximum total revenue among all the shops in September, 2014?

  • Hereinafter revenue refers to total sales minus value of goods returned.

Hints:

  • Sometimes items are returned, find such examples in the dataset.
  • It is handy to split date field into [day, month, year] components and use df.year == 14 and df.month == 9 in order to select target subset of dates.
  • You may work with date feature as with srings, or you may first convert it to pd.datetime type with pd.to_datetime function, but do not forget to set correct format argument.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
transactions["date"] = pd.to_datetime(transactions["date"],format = "%d.%m.%Y")
transactions["year"] = transactions.date.dt.year
transactions["month"] = transactions.date.dt.month
transactions["day"] = transactions.date.dt.day
transactions.drop('date', axis=1, inplace=True)
shortlist = transactions[(transactions.year == 2014) & (transactions.month == 9)]
shortlist = shortlist.assign(dailygain = lambda x: x.item_cnt_day * x.item_price)
revenuelist = np.array([])
for i in shortlist.shop_id:
revenuelist = np.append(revenuelist,shortlist[(shortlist.shop_id == i)].dailygain.sum())
answer_1 = revenuelist.max() #7982852.1999999564
max_revenue = 7982852.1999999564
grader.submit_tag('max_revenue', max_revenue)

2. What item category generated the highest revenue in summer 2014?

  • Submit id of the category found.

  • Here we call “summer” the period from June to August.

Hints:

  • Note, that for an object x of type pd.Series: x.argmax() returns index of the maximum element. pd.Series can have non-trivial index (not [1, 2, 3, ... ]).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
shortlist2 = transactions[(transactions.year == 2014) & ((transactions.month == 7)|(transactions.month == 6)|(transactions.month == 8))]
items = items.drop('item_name',1)
shortlist2full = pd.merge(shortlist2,items, how='left', on=['item_id'])
shortlist2full = shortlist2full.assign(dailygain = lambda x: x.item_cnt_day * x.item_price)
grouped = shortlist2full['dailygain'].groupby(shortlist2full['item_category_id'])
grouped.sum().argmax() #20
category_id_with_max_revenue = 20 # PUT YOUR ANSWER IN THIS VARIABLE
grader.submit_tag('category_id_with_max_revenue', category_id_with_max_revenue)

3. How many items are there, such that their price stays constant (to the best of our knowledge) during the whole period of time?

  • Let’s assume, that the items are returned for the same price as they had been sold.
1
2
3
4
5
6
7
8
quest3data = transactions[transactions.dailygain>0]
quest3data.head()
group3 = quest3data['item_price'].groupby(transactions['item_id'])
mylist = group3.nunique()
a = np.sum(mylist==1)# 5926
num_items_constant_price = a
grader.submit_tag('num_items_constant_price', num_items_constant_price)

4. What was the variance of the number of sold items per day sequence for the shop with shop_id = 25 in December, 2014? Do not count the items, that were sold but returned back later.

  • Fill total_num_items_sold and days arrays, and plot the sequence with the code below.
  • Then compute variance. Remember, there can be differences in how you normalize variance (biased or unbiased estimate, see link). Compute unbiased estimate (use the right value for ddof argument in pd.var or np.var).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
shop_id = 25
shortlist3 = transactions[(transactions.shop_id == shop_id) & (transactions.year == 2014) & (transactions.month == 12)]
#不需要筛选里添加 & (transactions.item_cnt_day>0)
list3 = shortlist3["item_cnt_day"].groupby(shortlist3['day']).sum().reset_index()
total_num_items_sold = list3.item_cnt_day# YOUR CODE GOES HERE
days = list3.day# YOUR CODE GOES HERE
# Plot it
plt.plot(days, total_num_items_sold)
plt.ylabel('Num items')
plt.xlabel('Day')
plt.title("Daily revenue for shop_id = 25")
plt.show()
total_num_items_sold_var = np.var(list3.item_cnt_day, ddof = 1)# PUT YOUR ANSWER IN THIS VARIABLE
grader.submit_tag('total_num_items_sold_var', total_num_items_sold_var)